ProTInSeq: transposon insertion tracking by ultra-deep DNA sequencing to identify translated large and small ORFs

Identifying open reading frames (ORFs) being translated is not a trivial task. ProTInSeq is a technique designed to characterize proteomes by sequencing transposon insertions engineered to express a selection marker when they occur in-frame within a protein-coding gene. In the bacterium Mycoplasma pneumoniae, ProTInSeq identifies 83% of its annotated proteins, along with 5 proteins and 153 small ORF-encoded proteins (SEPs; ≤100 aa) that were not previously annotated. Moreover, ProTInSeq can be utilized for detecting translational noise, as well as for relative quantification and transmembrane topology estimation of fitness and non-essential proteins. By integrating various identification approaches, the number of initially annotated SEPs in this bacterium increases from 27 to 329, with a quarter of them predicted to possess antimicrobial potential. Herein, we describe a methodology complementary to Ribo-Seq and mass spectroscopy that can identify SEPs while providing other insights in a proteome with a flexible and cost-effective DNA ultra-deep sequencing approach.

vector or synthetic gene (for EryAmut fragment) and the primers described in Table S1.The inverted repeat sequences are not mutated but it contains the EryA*.This vector is the negative control to ensure that three stop codons of inverted repeat sequence ensure the abortion of translation.This vector is used to evaluate the rate of spontaneous mutants that can be resistant to erythromycin.
The vector TnP 438 barIR* was obtained by doing the Gibson assembly with A1, B1 and C1 fragments obtained by PCR using as template the synthetic gene (B1 fragment) or the TnP 438 catIR* vector (A1 and C1 fragments) and the primers described in Table S1.Tnbar*IR* was obtained by doing the Gibson assembly with A2, B2 and C1 fragments obtained by PCR using as template the ordered synthetic gene (B2 fragment) or the TnP 438 catIR* vector (A2 and C1 fragments) and the primers described in the Supplementary Table 1.

Relating transposon analysis with transmembrane segments prediction in new and annotated ORFs
Despite not significant due to the limitations of the sizes evaluated, and the small number of new SEPs predicted to have transmembrane segments (n=39), on average in the highest selected CmB15 samples, 63 % ± 27 % of the insertions found in these smORFs were located in the TMHMM predicted cytoplasmic segments, which is lower than the results obtained for known transmembrane NE genes (81% ± 17%, one-tailed T-test P=0.12).After running the TMHMM is algorithm in 101 NE known proteins (35 lipoproteins, 66 transmembrane), our results matched the TMHMM predictions, with an error of ± 10 aa, for 41 proteins, failed in one segment in 39 (31 predicted to be cytoplasmic which are exposed in TMHMM predictions; 8 predicted to be cytoplasmic in TMHMM but found clean of insertions), and for 21 we could not do any prediction, 13 due to presenting repeated regions and the other 8 which presented at least three transmembrane segments and small non-transmembrane segments and therefore with an ir in-frame coverage was considerably reduced (21% ± 8%, one-tailed T-test P=0.12); thus preventing the efficient application of the algorithm.
but accounting for putative annotations (every smORF and ORF not included in the NCBI annotation) and regions with no annotation associated (non-coding), respectively.These categories, except for non-coding, take into consideration only the first base every three bases as labeled in Supplementary Data 2.

Supplementary Data 4. Combined coverage and read count values by sample for each position type in M. pneumoniae.
For each transposon library sequenced in this study (column A), separating by the different labels annotated, putative, non-coding, and E and NE genes from the "gold" set (column B), we include the number of nucleobases Supplementary Data 6.Available knowledge on M. pneumoniae M129 ORFome.
This table includes all the available information about the 30,112 sequences that could encode for a coding sequence in M. pneumoniae.For each identifier (column B), we include coordinates information and nucleotide and amino acid length information (columns C-H).Column I includes the gene name when the entry is found annotated in M. pneumoniae.Localization and function are described in columns J and K. Column L includes the operon number in which the annotation would be expressed.We also included transcription-related information average expression (column M; as log 2 (gene read count/gene length) and estimated average RNA copies per cell (column N) considering 4 RNA sequencing samples covering different growth times (6, 24 and 48 hours, ArrayExpress identifier E-MTAB-6203).Column O accounts for the number of mass spectrometry experiments detecting that entry (to a maximum of 116) and column P accounts for the total number of unique tryptic peptides detected.This is available for 12,426 sequences that present an amino acid length ≥19 (from 116 mass spectrometry experiments, ID PRIDE: PXD008243).Columns Q to T recapitulate protein copies per cell under different conditions (overall, extracting with urea, extracting with SDS and mean, respectively).Column U includes half-lives of the proteins.
Columns V and W describe the reference density of insertion and essentiality assigned in previous studies.
Columns X and Y include the predicted RanSEPs score and ribosome binding site presence.Column Z contains information relative to homology measured against a database of smORFs from >100 bacterial species obtained in Miravet et al. 2019.This included seven groups: 0-no hits passed the thresholds defined; 1-conserved with an annotated function; 2-conserved as an annotated SEP but no associated function; 3-conserved in a different species but target and homologous sequence not found in NCBI; 4-sequence is completely or partially (> 75%) repeated ≥ 3 times in the reference genome; 5-potential pseudogene; and 6-to depict those annotations that are already found in NCBI reference annotations; column AA includes the function expected provided by this homology search.Columns AB to AD cover the output provided by Phobius, including the number of transmembrane segments, presence of signal peptide and transmembrane topology predicted by TM-HMM.
Column AE includes the complex information where 1 implies that entry is functional as a monomer, 2 as dimer, and so on.Finally, columns AF-AH will be 1 if the protein is a Lon protease target, a lipoprotein, and/or a truncated gene or pseudogene, respectively, 0 otherwise.
considered (column C), number of insertions found in those positions (column D), coverage (as ratio between columns D and C, column E), total read count value (column F), average number of reads per insertion (column G) and standard deviation (column H).These values are used to define main figure2Aand 2C, and Extended Figure1.Supplementary Data 5. Paired statistical evaluation of the selection in different libraries and considering different position types in M. pneumoniae.Statistical comparative by one-tailed Mann-Whitney-U between transposon libraries separated by separating by the different labels annotated, putative, non-coding, and E and NE genes from the "gold" set.Column A includes an identifier in the format library_selection_concentration_anntype1_vs_anntype2_metric.Metrics compared, in column B, can be either coverage (cov) or mean read count (mean_r).Column C includes the antibiotic concentration used to grow the cultures (notice Barnase library does not have a concentration assigned and 'na' is included in those cases).Following columns include the metrics compared between two labeled base group types (1 and 2) showing the group identifiers (columns D and H); labels of the annotation types (columns E and I) that can be annotated, non-coding, putative, E and NE gold set of genes; average value (columns F and J) and standard deviations (columns G and K) used in the calculation of p-value using a one-tailed Mann-Whitney-U test (column L).

Supplementary Table 1. Details on the construction of each mini-transposon vector.
vector with the mutated barnase gene (bar*, no promoter and no start codon), flanked by IR*.Viable cells after transforming with this vector should not have proteins fused to the barnase gene in frame.The insertions detected with this transformation should not be found in the transformations with Tncat*IR* either TnEryA *IR*.Also the P 438 cat cassette is cloned downstream to the barnase gene to select for the transformed cells.mini-transposonvector with the genetic cassette P Syb EryA flanked by IR* sequences.This vector is the positive control of the transformation with TnEryA*IR*.Also, it is a control to ensure that mutation in the IR does not affect the efficiency of the transposition (by comparing with TnP Syn EryA vector).
mini-transposon vector with the genetic cassette P 438 cat (chloramphenicol acetyltransferase gene under the P438 constitutive promoter of the Mycoplasma genitalium gene mg438 [39]), flanked by the inverted repeat sequences (IR).It is the positive control in the transformation.TnP 438 cat IR* mini-transposon vector with the genetic cassette P 438 cat flanked by mutated inverted repeat sequences (IR*).This vector is the positive control of the transformation with TnCat*IR*.Also, it is a control to ensure that mutation in the IR does not affect the efficiency of the transposition.mini-transposon vector with the genetic cassette P Syn EryA (macrolide 2'-phosphotransferase gene under the synthetic promoter [32], flanked by IR sequences.It is the positive control in the transformation.